OpenMP task scheduling strategies for multicore NUMA systems

نویسندگان

  • Stephen Olivier
  • Allan Porterfield
  • Kyle B. Wheeler
  • Michael Spiegel
  • Jan Prins
چکیده

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different scheduling methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In order to evaluate scheduling strategies, we extended the open-source Qthreads threading library to implement our schedulers, accepting OpenMP programs through the ROSE compiler. We present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven different task parallel run time scheduler implementations on an Intel Nehalem multisocket multicore system: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads fully-distributed scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations. Hierarchical scheduling in Qthreads is competitive on all benchmarks tested. On five of the seven benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run time systems. Our run time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures

Current generation of high performance computing platforms tends to hold a large number of cores. Therefore applications have to expose a fine-grain parallelism to be more efficient. Since version 3.0, the OpenMP standard proposes a way to express such parallelism through tasks. Because the task scheduling strategy is implementation defined, each runtime can have a different behavior and effici...

متن کامل

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about...

متن کامل

Towards Efficient OpenMP Strategies for Non-Uniform Architectures

Memory Access (NUMA) based processors architectures. In these architectures, analyzing and considering the non-uniformity is of high importance for improving scalability of systems. In this paper, we analyze and develop a NUMA based approach for the OpenMP parallel programming model. Our technique applies a smart threads allocation method and an advanced tasks scheduling strategy for reducing r...

متن کامل

Task Parallel Models Based on Dynamic Data Placement to Reduce NUMA Effects

NUMA (Non-Uniform Memory Access) multicore computers become popular in scientific and industrial fields due to its scalable memory performance. However, large-scale intensive data computing on NUMA architecture are facing up to the challenges in data locality problems called NUMA effects that are caused by the overhead accesses of cross-node data. Our task parallel model bases on the strategy o...

متن کامل

Modeling Memory System Performance of NUMA Multicore-Multiprocessors

The performance of many applications depends closely on the way they interact with the computer’s memory system: Many applications obtain good performance only if they utilize the memory system efficiently. Unfortunately, obtaining good memory system performance is often difficult, as developing memory system-aware (system) software requires a thorough and detailed understanding of both the cha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJHPCA

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2012